Lab assignment: fraud detection through ensemble methods

In this assignment we will use all the skills in ensemble learning we acquire from previous exercises to build a an automated fraud detection system.

Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed. Both correctness of the solution and code quality will be taken into account for marking.
This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment.
This is an advanced and voluntary excercise that can help you gain a deeper knowledge into the topic. This exercise won't be taken into account towards marking, but you are encouraged to undertake it. Good luck!

To avoid missing packages and compatibility issues you should run this notebook under one of the recommended Ensembles environment files.

The following code will embed any plots into the notebook instead of generating a new window:

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells.

Let's go!

Data loading

The data for this problem is included in the data folder, with separate files for training and test data. Each file includes several unidentified explanatory features, together with an "Amount" feature and the target "Class". Fraudulent operations are marked as Class == 1.

In [2]:
import pandas as pd
import numpy as np
Load the training and test data into Pandas DataFrames with names train and test, respectively.
In [3]:
####### INSERT YOUR CODE HERE

train = pd.read_csv('./data/fraud_train.csv', encoding = 'utf-8')
test = pd.read_csv('./data/fraud_test.csv', encoding = 'utf-8')
Analyze the training data. How many explanatory variables do you have? What is the distribution of classes?
In [4]:
train.shape
Out[4]:
(5246, 30)
In [5]:
test.shape
Out[5]:
(5246, 30)
In [6]:
####### INSERT YOUR CODE HERE
sns.pairplot(train, hue = "Class")
Out[6]:
<seaborn.axisgrid.PairGrid at 0x1f243fb79b0>

En el gráfico de arriba podemos ver las correlaciones entre todas las variables, así como la división (en algunos casos, pues en otros scatterplot vemos las dos clases juntas) entre las operaciones en las que se hizo fraude y aquellas en las que no.

In [7]:
train.describe()
Out[7]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
count 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 ... 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000 5246.000000
mean -0.539993 0.182329 0.302563 0.308565 -0.470484 -0.037538 -0.397722 0.131348 -0.212005 -0.269800 ... 0.010017 -0.132105 -0.051927 0.010908 0.133937 0.019680 0.023637 -0.001524 93.582991 0.046893
std 2.700370 2.111652 2.698561 1.715353 2.036673 1.411958 2.273415 2.027100 1.322918 1.863553 ... 1.154418 0.696071 0.748970 0.583605 0.470548 0.491799 0.460188 0.350230 250.696936 0.211430
min -34.591213 -44.639245 -31.103685 -5.519697 -32.092129 -21.248752 -21.922811 -37.353443 -9.283925 -18.271168 ... -12.815353 -8.887017 -26.751119 -2.185457 -7.495741 -1.345640 -7.144717 -8.364853 0.000000 0.000000
25% -1.160978 -0.556885 0.070103 -0.719923 -0.999000 -0.734639 -0.673759 -0.137595 -0.803361 -0.566849 ... -0.227437 -0.549349 -0.178857 -0.332102 -0.136150 -0.328017 -0.061094 -0.007941 5.102500 0.000000
50% -0.341215 0.142323 0.692239 0.214909 -0.360585 -0.226545 -0.100690 0.079762 -0.173325 -0.116838 ... -0.053649 -0.103794 -0.045147 0.068361 0.168982 -0.080875 0.015344 0.023243 22.190000 0.000000
75% 1.157161 0.867764 1.357493 1.089141 0.211143 0.400486 0.395594 0.401846 0.454970 0.422938 ... 0.126478 0.299756 0.087792 0.402059 0.434307 0.289032 0.102887 0.083911 81.665000 0.000000
max 1.618082 16.713389 3.971381 11.927512 31.457046 21.393069 34.303177 20.007208 7.938980 11.519106 ... 27.202839 4.534454 5.303607 3.979637 2.208209 2.964300 4.444505 5.414028 7712.430000 1.000000

8 rows × 30 columns

In [8]:
train.dtypes
Out[8]:
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object
In [9]:
train.isnull().sum()
Out[9]:
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

Como vemos, no tenemos valores ausentes en ninguna variable.

Sólo el 4% de nuestras observaciones pertenecen a la clase 1 (hace fraude). Esto seguramente dificultará el análisis, pues las clases no están balanceadas, y en cualquier clasificador podrá obtener un error muy pequeño prediciendo siempre que no va a hacer fraude el cliente. Tenemos 5246 observaciones; es decir no demasiadas, por lo que tendremos que tener cuidado a la hora de utilizar modelos demasiado complejos (Random Forest o XGB con demasiados estimadores con respecto al número de observaciones).

Lo primero que nos llama la atención de lo que vemos es el gran número de variables que tenemos. Se observan algunas relaciones lineales bastante claras entre las variables, y otras no tan lineales ni tan fuertes. Vamos a analizar la variable objetivo.

In [10]:
sns.countplot(x='Class',data=train, palette='hls')
plt.show()

En el gráfico superior podemos ver que las clases están muy desbalanceadas, como ya comentábamos antes. Al graficarlo podemos hacernos una idea visual de dicho desbalanceo.

In [11]:
plt.rcParams['figure.figsize'] = (10, 10)
sns.distplot(train['Amount'])
plt.show()
C:\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6521: MatplotlibDeprecationWarning: 
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  alternative="'density'", removal="3.1")

Vemos que hay verdaderamente pocas operaciones con una cantidad (Amount) superior a 1000 dólares.

Unsupervised fraud detector

Fraudulent activities are usually prosecuted, therefore fraudsters need to be creative and come up constantly with new ways of performing fraud. Furthermore, frauds are scarce (fortunately), and so we have few positive class patterns available for training. Because of these facts, it might make sense to build an unsupervised fraud detector.

Dentro del aprendizaje no supervisado, podemos encontrar diferentes métodos o modelos que se suelen utilizar para la detección de anomalías, y en concreto en el caso de uso del fraude. Dos ejemplos son este y este. Un artículo muy interesante sobre la detección de fraude mediante el uso de diferentes algoritmos de Machine Learning lo podemos encontrar aquí; en él podemos inspirarnos sobre algunos de los modelos que podemos utilizar para contrastar los resultados del Isolation Forest. Este es el modelo en el que más nos centraremos pues el objetivo es optimizar el aprendizaje de ensembles, pero esto nos servirá para contrastar mejor los resultados y poder darles una proporcionalidad a las magnitudes de error que encontremos. De otra forma no tendríamos manera de saber cómo de bien lo estamos haciendo con respecto a una muestra de lo que se podría hacer alternativamente, sólo sabríamos cómo lo estamos haciendo con respecto a Isolation Forests con otros hiperparámetros.

Una forma de afrontar este tipo de problemas sería mediante clustering; otras, están basadas en la desviación local de densidad como LOF (Local Outlier Factor), o en Hidden Markov Models. Por otro lado, tenemos las One Class Support Vector Machines, algoritmos modificados de la SVM en los que, mediante aprendizaje no supervisado, tratamos de detectar anomalías y outliers. En última instancia, de los más importantes y que más se repiten en los trabajos relativos a esta materia, tenemos los AutoEncoders, un tipo de algoritmo de Deep Learning, también no supervisado, que se utiliza a menudo para la reducción de la dimensión, así como para la detección de anomalías en los datos y por lo tanto también para luchar contra el fraude.

In [12]:
def RocPlot(true, **kwargs):
    plt.figure(figsize=(15, 5))
    for model in kwargs:
        fpr, tpr, _ = roc_curve(true, kwargs[model])
        roc = roc_auc_score(true, kwargs[model])
        plt.plot(fpr, tpr, lw=2, label='%s (area = %0.3f)' % (model, roc))
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc="lower right")
In [13]:
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Greens):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    #classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

OneClass SVM

In [14]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.metrics import roc_auc_score, make_scorer, roc_curve, recall_score, classification_report, confusion_matrix
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib as mpl
from matplotlib.colors import ListedColormap
from sklearn.neighbors import LocalOutlierFactor
from yellowbrick.classifier import ClassificationReport
In [15]:
X_train, y_train = train.drop('Class', axis = 1), train['Class']
X_test, y_test = test.drop('Class', axis = 1), test['Class']
contam = X_train.loc[y_train == 1, :].shape[0]/X_train.shape[0]
In [16]:
res = {}
In [17]:
ocsvmparams = {'kernel':['rbf', 'linear'],
              'gamma':['scale', 'auto', 0.1],
              'nu':[0.05, 0.95*contam],
              'random_state':[69],
              'verbose':[True]}

skfold = StratifiedKFold(n_splits = 3)

folds = list(skfold.split(X_train, y_train))

ocsvmgrid = GridSearchCV(estimator = OneClassSVM(), 
                         param_grid = ocsvmparams, 
                         scoring = 'roc_auc', n_jobs = -1, 
                        cv = folds, verbose = 1)
In [18]:
ocsvmgrid.fit(X_train, y_train)
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[LibSVM]
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:   11.5s finished
C:\Anaconda3\lib\site-packages\sklearn\svm\classes.py:1175: DeprecationWarning: The random_state parameter is deprecated and will be removed in version 0.22.
  " be removed in version 0.22.", DeprecationWarning)
Out[18]:
GridSearchCV(cv=[(array([  82,   83, ..., 5244, 5245]), array([   0,    1, ..., 1911, 1912])), (array([   0,    1, ..., 5244, 5245]), array([  82,   83, ..., 3578, 3579])), (array([   0,    1, ..., 3578, 3579]), array([ 164,  165, ..., 5244, 5245]))],
       error_score='raise-deprecating',
       estimator=OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, nu=0.5, random_state=None, shrinking=True,
      tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'kernel': ['rbf', 'linear'], 'gamma': ['scale', 'auto', 0.1], 'nu': [0.05, 0.04454822722073961], 'random_state': [69], 'verbose': [True]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=1)
In [19]:
ocsvmgrid.best_estimator_
Out[19]:
OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='scale',
      kernel='linear', max_iter=-1, nu=0.05, random_state=69,
      shrinking=True, tol=0.001, verbose=True)
In [20]:
preds = ocsvmgrid.predict(X_test)

preds = [1 if pred==-1 else 0 for pred in preds]

res['SVC'] = {}

res['SVC']['preds'] = preds
In [21]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

           0       0.95      0.87      0.91      5000
           1       0.03      0.10      0.05       246

   micro avg       0.83      0.83      0.83      5246
   macro avg       0.49      0.48      0.48      5246
weighted avg       0.91      0.83      0.87      5246

In [22]:
confusion_matrix(y_test, preds)
Out[22]:
array([[4327,  673],
       [ 222,   24]], dtype=int64)

Como vemos tanto en el classification report, como en la matriz de confusión, como en el gráfico del ROC, este modelo es bastante malo.

In [23]:
RocPlot(y_test, model= -ocsvmgrid.decision_function(X_test))
In [24]:
plot_confusion_matrix(y_true=y_test, y_pred= preds, classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for SVC',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[0.8654     0.1346    ]
 [0.90243902 0.09756098]]
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f27dc1eac8>

Isolation Forest

Primero vamos a probar un IsolationForest simple, sin hacer nada de hyperparameter tuning, solo para ver cómo se comporta el algoritmo y, sin modificarle demasiado, qué tal detecta el fraude.

In [25]:
iso = IsolationForest(contamination=train[train['Class']==1].shape[0]/train.shape[0])
In [26]:
iso.fit(X_train)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\iforest.py:223: FutureWarning: behaviour="old" is deprecated and will be removed in version 0.22. Please use behaviour="new", which makes the decision_function change to match other anomaly detection algorithm API.
  FutureWarning)
Out[26]:
IsolationForest(behaviour='old', bootstrap=False,
        contamination=0.04689287075867327, max_features=1.0,
        max_samples='auto', n_estimators=100, n_jobs=None,
        random_state=None, verbose=0)
In [27]:
preds = iso.predict(X_test)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\iforest.py:417: DeprecationWarning: threshold_ attribute is deprecated in 0.20 and will be removed in 0.22.
  " be removed in 0.22.", DeprecationWarning)
In [28]:
preds = [0 if pred==1 else 1 for pred in preds]
In [29]:
RocPlot(y_test, model = -iso.decision_function(X_test))
Using only the training data, create an anomaly detection model. You should also choose an error metric adequate for the problem, and tune the model parameters in order to optimize this error.

Using Recall and AUC as Scoring Metrics

In [30]:
X_train, y_train = train.drop('Class', axis = 1), train['Class']
X_test, y_test = test.drop('Class', axis = 1), test['Class']
In [31]:
param_grid = {'n_estimators':[200, 500, 800, 1000],
             'max_samples':['auto', 0.7], 
             'max_features':[1.0, 0.9, 0.8],
             'bootstrap':[True, False],
             'n_jobs':[-1],
             'random_state':[69],
             'verbose':[1], 
             'contamination':[0.04, 0.10, 0.20]}

scores = {'Recall':make_scorer(recall_score,greater_is_better = True, pos_label = -1, average='macro'), 'AUC': 'roc_auc'}

skfold = StratifiedKFold(n_splits = 3)

folds = list(skfold.split(X_train, y_train))
In [32]:
isogrid = GridSearchCV(estimator = IsolationForest(behaviour='new'), param_grid=param_grid, 
                        scoring = scores,
                            n_jobs = -1, cv = folds, verbose = 1,refit = 'Recall')
In [33]:
np.unique(y_train)
Out[33]:
array([0, 1], dtype=int64)
In [34]:
isogrid.fit(X_train, y_train)
Fitting 3 folds for each of 144 candidates, totalling 432 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  8.9min
[Parallel(n_jobs=-1)]: Done 432 out of 432 | elapsed: 19.8min finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    0.1s finished
Out[34]:
GridSearchCV(cv=[(array([  82,   83, ..., 5244, 5245]), array([   0,    1, ..., 1911, 1912])), (array([   0,    1, ..., 5244, 5245]), array([  82,   83, ..., 3578, 3579])), (array([   0,    1, ..., 3578, 3579]), array([ 164,  165, ..., 5244, 5245]))],
       error_score='raise-deprecating',
       estimator=IsolationForest(behaviour='new', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100,
        n_jobs=None, random_state=None, verbose=0),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [200, 500, 800, 1000], 'max_samples': ['auto', 0.7], 'max_features': [1.0, 0.9, 0.8], 'bootstrap': [True, False], 'n_jobs': [-1], 'random_state': [69], 'verbose': [1], 'contamination': [0.04, 0.1, 0.2]},
       pre_dispatch='2*n_jobs', refit='Recall', return_train_score='warn',
       scoring={'Recall': make_scorer(recall_score, pos_label=-1, average=macro), 'AUC': 'roc_auc'},
       verbose=1)
In [35]:
isogrid.best_estimator_
Out[35]:
IsolationForest(behaviour='new', bootstrap=True, contamination=0.04,
        max_features=0.8, max_samples='auto', n_estimators=200, n_jobs=-1,
        random_state=69, verbose=1)
In [36]:
preds = isogrid.predict(X_test)

preds = [0 if pred==1 else 1 for pred in preds]

res['ISOFO1'] = {}

res['ISOFO1']['preds'] = preds
In [37]:
confusion_matrix(y_test, preds)
Out[37]:
array([[4907,   93],
       [ 113,  133]], dtype=int64)
In [38]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      5000
           1       0.59      0.54      0.56       246

   micro avg       0.96      0.96      0.96      5246
   macro avg       0.78      0.76      0.77      5246
weighted avg       0.96      0.96      0.96      5246

In [39]:
plot_confusion_matrix(y_true=y_test, y_pred= preds, classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for Isolation Forest 1',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[0.9814     0.0186    ]
 [0.45934959 0.54065041]]
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f200149588>
In [40]:
isogrid.decision_function(X_test)
Out[40]:
array([-0.04304269, -0.04283613, -0.04278415, ...,  0.12641077,
        0.11555498,  0.11235019])
In [41]:
RocPlot(y_test, model = -isogrid.decision_function(X_test))

Vamos a intentar mejorar el Isolation Forest de arriba utilizando como scoring el AUC en lugar de el Recall y el AUC. Además, le introduciremos la contaminación real de la muestra, dejando éste de ser un parámetro a optimizar.

Using only AUC as Scoring Metric and Adding Real Contamination as a Parameter

In [42]:
contam
Out[42]:
0.04689287075867327
In [43]:
param_grid = {'n_estimators':[500, 800, 1000],
             'max_samples':['auto', 0.7], 
             'max_features':[1.0, 0.9, 0.8],
             'bootstrap':[True, False],
             'n_jobs':[-1],
             'random_state':[69],
             'verbose':[1]}

score = 'roc_auc'

skfold = StratifiedKFold(n_splits = 3)

folds = list(skfold.split(X_train, y_train))
In [44]:
isogrid2 = GridSearchCV(estimator = IsolationForest(behaviour='new', contamination=contam), param_grid=param_grid, 
                        scoring = score,
                            n_jobs = -1, cv = folds, verbose = 1)
In [45]:
isogrid2.fit(X_train, y_train)
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  4.0min finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    0.5s remaining:    0.5s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    0.5s finished
Out[45]:
GridSearchCV(cv=[(array([  82,   83, ..., 5244, 5245]), array([   0,    1, ..., 1911, 1912])), (array([   0,    1, ..., 5244, 5245]), array([  82,   83, ..., 3578, 3579])), (array([   0,    1, ..., 3578, 3579]), array([ 164,  165, ..., 5244, 5245]))],
       error_score='raise-deprecating',
       estimator=IsolationForest(behaviour='new', bootstrap=False,
        contamination=0.04689287075867327, max_features=1.0,
        max_samples='auto', n_estimators=100, n_jobs=None,
        random_state=None, verbose=0),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [500, 800, 1000], 'max_samples': ['auto', 0.7], 'max_features': [1.0, 0.9, 0.8], 'bootstrap': [True, False], 'n_jobs': [-1], 'random_state': [69], 'verbose': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=1)
In [46]:
preds = isogrid2.predict(X_test)

preds = [0 if pred==1 else 1 for pred in preds]

res['ISOFO2'] = {}

res['ISOFO2']['preds'] = preds
In [47]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      5000
           1       0.57      0.64      0.60       246

   micro avg       0.96      0.96      0.96      5246
   macro avg       0.78      0.81      0.79      5246
weighted avg       0.96      0.96      0.96      5246

Vemos que en este caso sube un poco el recall del fraude, así como el f1-score

In [48]:
plot_confusion_matrix(y_true=y_test, y_pred= preds, classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for Isolation Forest 2',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[0.9764     0.0236    ]
 [0.36178862 0.63821138]]
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f2006c3c50>

Pero sigue habiendo 118 casos en los que dijimos que no había fraude y sin embargo sí lo había.

In [49]:
RocPlot(y_test, model = -isogrid2.decision_function(X_test))

Mejora un poco el AUC con respecto al modelo anterior. Podríamos, con el fin de predecir más True Positives (1 verdaderos), sesgar al modelo un poco, haciéndole fallar en algunos 0s (es decir, prediciendo como 1 observaciones en las que no hubo fraude), ya que estos casos son menos dañinos para la compañía que los casos en los que predecimos que no hay fraude y sin embargo sí lo hay.

Otra de las acciones que podríamos llevar a cabo para tratar de mejorar el IsolationForest es escalar todas las variables (excepto la clase, evidentemente, pues ésta ya está en binario {0,1}). Esto lo aplicaremos a todos los algoritmos que utilicemos más adelante.

Scaling the Features

In [50]:
clasetr = train['Class']

clasete = test['Class']

scaler = StandardScaler()

scaled_train = scaler.fit_transform(train.drop('Class', axis = 1))

X_train = scaled_train

y_train = clasetr

scaled_test = scaler.fit_transform(test.drop('Class', axis = 1))

X_test = scaled_test

y_test = clasete
In [51]:
contam = X_train[y_train == 1, :].shape[0]/X_train.shape[0]
In [52]:
#y_train = np.array(y_train).reshape(-1, 1)

#y_test = np.array(y_test).reshape(-1, 1)
In [53]:
param_grid = {'n_estimators':[500, 800, 1000],
             'max_samples':['auto', 0.7, 0.5], 
             'max_features':[1.0, 0.5, 0.8],
             'bootstrap':[True],
             'n_jobs':[-1],
             'random_state':[69],
             'verbose':[1]}

score = 'roc_auc'

skfold = StratifiedKFold(n_splits = 3)

folds = list(skfold.split(X_train, y_train))

isogrid3 = GridSearchCV(estimator = IsolationForest(behaviour='new', contamination=contam), param_grid=param_grid, 
                        scoring = score,
                            n_jobs = -1, cv = folds, verbose = 1)
In [54]:
isogrid3.fit(X_train, y_train)
Fitting 3 folds for each of 27 candidates, totalling 81 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  3.1min finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    0.4s remaining:    0.4s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    0.5s finished
Out[54]:
GridSearchCV(cv=[(array([  82,   83, ..., 5244, 5245]), array([   0,    1, ..., 1911, 1912])), (array([   0,    1, ..., 5244, 5245]), array([  82,   83, ..., 3578, 3579])), (array([   0,    1, ..., 3578, 3579]), array([ 164,  165, ..., 5244, 5245]))],
       error_score='raise-deprecating',
       estimator=IsolationForest(behaviour='new', bootstrap=False,
        contamination=0.04689287075867327, max_features=1.0,
        max_samples='auto', n_estimators=100, n_jobs=None,
        random_state=None, verbose=0),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [500, 800, 1000], 'max_samples': ['auto', 0.7, 0.5], 'max_features': [1.0, 0.5, 0.8], 'bootstrap': [True], 'n_jobs': [-1], 'random_state': [69], 'verbose': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=1)
In [55]:
isogrid3.best_estimator_
Out[55]:
IsolationForest(behaviour='new', bootstrap=True,
        contamination=0.04689287075867327, max_features=0.8,
        max_samples=0.7, n_estimators=500, n_jobs=-1, random_state=69,
        verbose=1)
In [56]:
preds = isogrid3.predict(X_test)
In [57]:
preds = [0 if pred==1 else 1 for pred in preds]

res['ISOFO3'] = {}

res['ISOFO3']['preds'] = preds
In [58]:
plot_confusion_matrix(y_true=y_test, y_pred= preds, classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for Isolation Forest 3',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[0.9766     0.0234    ]
 [0.33333333 0.66666667]]
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f200629978>
In [59]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      5000
           1       0.58      0.67      0.62       246

   micro avg       0.96      0.96      0.96      5246
   macro avg       0.78      0.82      0.80      5246
weighted avg       0.96      0.96      0.96      5246

In [60]:
RocPlot(y_test, model = -isogrid3.decision_function(X_test))

Parece que ahora hemos conseguido mejorar un poco el modelo, aumentando el AUC y el Recall de los 1.

Cluster-Based Anomaly Detection

Simultaneous implementation of multiple clustering models

Aprovechando que tenemos un conjunto de entrenamiento con algunas clases predichas, vamos a diseñar una función que encuentre el mejor modelo de cluster y la mejor forma de asignar un cluster a la clase de operación fraudulenta. Por ello mezclaremos una técnica de aprendizaje no supervisado puro con el uso de etiquetas (hay que aprovechar siempre la información que se tiene).

In [61]:
models = ["KMeans", "GaussianMixture", "DBSCAN"]

scaler = StandardScaler()

var_name = "Class"
Xtrain = pd.DataFrame(scaler.fit_transform(train.drop(var_name, axis = 1)))

ytrain = train[var_name]
    
Xtest = pd.DataFrame(scaler.fit_transform(test.drop(var_name, axis = 1)))
ytest = test[var_name]
In [63]:
def get_fraudulent_cluster(models, train, test, max_clusters = 20, seed = 69, var_name = 'Class'):
    '''
    This function uses an unsupervised 
    Parameters:
        - model: the unsupervised model used to fit the data.
        - train: the train set (including the target if exist)
        - max_clusters: the maximum number of clusters to fit.
        - seed
        - var_name: the name of the Target variable (if exists)
    
    Returns:
        - results: a dictionary with the best models for each type of model. 
    '''
    
    scaler = StandardScaler()
    
    results = {}
    
    #fraud_index = train.index[train[var_name]==1].tolist()
    
    X = pd.DataFrame(scaler.fit_transform(train.drop(var_name, axis = 1)))
    
    Xtest, ytest = pd.DataFrame(scaler.fit_transform(test.drop(var_name, axis = 1))), test[var_name]
    
    for model in models:
        
        if model != "DBSCAN":
              
            best_clust = 0

            best_opt = 0

            for i in range(2, max_clusters):

                if model == 'KMeans':

                    m = KMeans(n_clusters=i, random_state = seed, max_iter = 200)

                    g = m.fit_predict(X)

                elif model == "GaussianMixture":

                    m = GaussianMixture(n_components=i, random_state = seed, max_iter = 200)

                    g = m.fit_predict(X)
                    
                train['group'] = g

                obtained = 0

                clus = 0

                for a in range(i):

                    if train[train['group'] == a][train[var_name] == 1].shape[0] > obtained:

                        obtained = train[train['group'] == a][train[var_name] == 1].shape[0]

                        clus = a

                if obtained > best_opt:

                    best_clust = clus

                    best_opt = obtained
            
            gr_test = m.predict(Xtest)
            
            gr_test = [1 if g==best_clust else 0 for g in gr_test]
            
            roc = roc_auc_score(ytest, gr_test)
            
            recall = recall_score(ytest, gr_test)
            
            _clust = {'model': m,
                    'clusters': best_clust,
                    'obtained': best_opt,
                     'predictions': gr_test,
                     'roc_auc_score': roc,
                     'recall_score': recall}

            results[model] = _clust
            
        elif model == "DBSCAN":
                
                m =  DBSCAN(eps = 0.5, min_samples = 10)
                
                g = m.fit_predict(X)
                                
                train['group'] = [1 if t == -1 else 0 for t in g]

                score = recall_score(y_true = train[var_name], y_pred = train['group'])
                
                pre = m.fit_predict(Xtest)
                
                pre = [1 if t == -1 else 0 for t in pre]
                
                recall = recall_score(ytest, pre)
                
                roc = roc_auc_score(ytest, pre)
                
                _clust = {'model':m,
                         'score': score,
                         'recall': recall,
                         'roc_auc_score': roc,
                         'predictions': pre}
                
                results[model] = _clust

        else:
            
            print("Model not available")
    
    return results
            
    
    
In [64]:
dic = get_fraudulent_cluster(models = models, train = train, test = test)
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:57: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:55: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
C:\Anaconda3\lib\site-pac
limit_output extension: Maximum message size of 10000 exceeded with 11571 characters
In [65]:
dic
Out[65]:
{'KMeans': {'model': KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=200,
      n_clusters=19, n_init=10, n_jobs=None, precompute_distances='auto',
      random_state=69, tol=0.0001, verbose=0),
  'clusters': 1,
  'obtained': 127,
  'predictions': [0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   1,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   0,
   0,
   1,
   1,
   0,
   0,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   ...],
  'roc_auc_score': 0.41376260162601625,
  'recall_score': 0.02032520325203252},
 'GaussianMixture': {'model': GaussianMixture(covariance_type='full', init_params='kmeans', max_iter=200,
          means_init=None, n_components=19, n_init=1, precisions_init=None,
          random_state=69, reg_covar=1e-06, tol=0.001, verbose=0,
          verbose_interval=10, warm_start=False, weights_init=None),
  'clusters': 0,
  'obtained': 223,
  'predictions': [0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0
limit_output extension: Maximum message size of 10000 exceeded with 19047 characters
In [66]:
plot_confusion_matrix(y_true=y_test, y_pred= dic['KMeans']['predictions'], classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for KMeans',
                          cmap=plt.cm.Greens)

plot_confusion_matrix(y_true=y_test, y_pred= dic['GaussianMixture']['predictions'], classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for Gaussian Mixture',
                          cmap=plt.cm.Greens)


plot_confusion_matrix(y_true=y_test, y_pred= dic['DBSCAN']['predictions'], classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for DBSCAN',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[0.8072    0.1928   ]
 [0.9796748 0.0203252]]
Normalized confusion matrix
[[0.9208     0.0792    ]
 [0.99593496 0.00406504]]
Normalized confusion matrix
[[0.034 0.966]
 [0.    1.   ]]
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f200d03470>

Como vemos en el diccionario que nos devuelve la función, ninguno de estos métodos de aprendizaje no supervisado es demasiado bueno; el K-Means y GaussianMixture porque no captan suficiente fraude, y por lo tanto tienen un recall score y un AUC demasiado bajo, y el DBSCAN porque predice demasiado fraude, demasiadas anomalías, y por lo tanto tampoco es un buen modelo pese a su buen recall score (en el AUC vemos que es bastante peor que el IsolationForest).

In [67]:
res['DBSCAN'] = {}

res['DBSCAN']['preds'] = dic['DBSCAN']['predictions']
In [68]:
res.keys()
Out[68]:
dict_keys(['SVC', 'ISOFO1', 'ISOFO2', 'ISOFO3', 'DBSCAN'])

Local Outlier Factor

Para encontrar el número óptimo de vecinos que necesitamos para identificar mejor las anomalías (el fraude), vamos a iterar desde 3 a 30 vecinos, y vamos a ver con qué número de vecinos la clasificación que nos da el modelo deja un AUC más alto.

In [69]:
max_n = 30

auc = 0

opt_n = 0

for i in range(3, max_n):
    
    lof = LocalOutlierFactor(n_neighbors=i, contamination = contam, n_jobs = -1, novelty=True)
    
    fit = lof.fit(Xtrain, y = ytrain)
    
    preds = lof.predict(Xtest)
    
    preds = [1 if pred==-1 else 0 for pred in preds]
    
    auc_i = roc_auc_score(ytest, preds)
    
    if auc_i > auc:
        
        auc = auc_i
        
        opt_n = i
        
        
        
In [70]:
auc, opt_n
Out[70]:
(0.5311284552845529, 26)

Vemos que el número óptimo de vecinos es de 26, mientras que el AUC es de 0.53, bastante por debajo del obtenido por el Isonation Forest. Como vemos, ningún modelo no supervisado se acerca al rendimiento del Isolation Forest para detectar el fraude, por lo que nos quedaremos con este.

In [71]:
lof = LocalOutlierFactor(n_neighbors=26, contamination = contam, n_jobs = -1, novelty=True)

lof.fit(Xtrain, ytrain)

preds = lof.predict(Xtest)
In [72]:
preds = [1 if pred==-1 else 0 for pred in preds]
In [73]:
res['LOF'] = {}
res['LOF']['preds'] = preds
In [74]:
plot_confusion_matrix(y_true=y_test, y_pred= preds, classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for LOF',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[0.8712     0.1288    ]
 [0.80894309 0.19105691]]
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f200c96d68>

VISUALIZATION OF UNSUPERVISED LEARNING METHODS' PERFORMANCE

Create a visualization showing the performance of this model over the test data.
In [139]:
%matplotlib inline
fraud_index = train.index[train["Class"]==1].tolist()
plt.rcParams['figure.figsize'] = (12, 9)

pca = PCA(n_components = 2)
In [140]:
components = pd.DataFrame(pca.fit_transform(Xtrain))
In [141]:
components['Fraud'] = y_train
In [142]:
cmap = plt.cm.PiYG
In [143]:
N = 2

fig, ax = plt.subplots(1,1, figsize=(10,8))
cmaplist = [cmap(i) for i in range(cmap.N)]
# create the new map
cmap = cmap.from_list('Custom cmap', cmaplist, cmap.N)

# define the bins and normalize
bounds = np.linspace(0,N-1 ,N+1)
norm = mpl.colors.BoundaryNorm(bounds, cmap.N)

scat = ax.scatter(components.loc[:, 0],components.loc[:, 1],\
                  c=components['Fraud'],\
                  #s=np.random.randint(100,500,N),\
                  cmap=cmap, norm=norm)
# create the colorbar
cb = plt.colorbar(scat, spacing='proportional',ticks=bounds)
cb.set_label('Custom cbar')
ax.set_title('')
plt.show()

Antes de entrar a visualizar las clasificaciones que han hecho nuestros modelos de aprendizaje no supervisado, vamos a ver, para entender mejor el funcionamiento interno de cada modelo y cómo separa los datos, una visualización en la que entrenamos los modelos solo con los dos primeros componentes principales y visualizamos sus regiones de la función de decisión sobre los mismos; primero solo sobre el train set y posteriormente con los modelos entrenados en el train sobre el test set. Esto nos permitirá ver, por ejemplo, que la región de decisión para un SVC es lineal, y cuando los datos no se pueden separar perfectamente de forma lineal tendrá algunos problemas a la hora de identificar anomalías; en el caso del LOF veremos una región de decisión basada en gran parte en elipses, lo cual muestra su relación con la estadística gaussiana.

In [80]:
%matplotlib inline

names1 = ['SVC', 'LOF']

names2 = ['ISOLATION FOREST']

models1 = [
    
    OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='scale',
      kernel='linear', max_iter=-1, nu=0.05, random_state=69,
      shrinking=True, tol=0.001, verbose=True),
    
     LocalOutlierFactor(n_neighbors=26, contamination = contam, n_jobs = -1, novelty=True)
    
     #DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
      #metric_params=None, min_samples=10, n_jobs=None, p=None),
    
    
]

models2 = [IsolationForest(behaviour='new', bootstrap=True,
        contamination=0.04689287075867327, max_features=0.8,
        max_samples=0.7, n_estimators=500, random_state=69,
        verbose=1)]
           
datasets = [train, test]
In [81]:
dec1 = IsolationForest(behaviour='new', bootstrap=True,
        contamination=0.04689287075867327, max_features=0.8,
        max_samples=0.7, n_estimators=500, random_state=69,
        verbose=1).fit(pd.DataFrame(PCA(n_components=2).fit_transform\
                                   (StandardScaler().fit_transform(train.drop("Class", axis = 1)))))\
        .decision_function(pd.DataFrame(PCA(n_components=2).fit_transform\
                                   (StandardScaler().fit_transform(train.drop("Class", axis = 1)))))

dec2 = IsolationForest(behaviour='new', bootstrap=True,
        contamination=0.04689287075867327, max_features=0.8,
        max_samples=0.7, n_estimators=500, random_state=69,
        verbose=1).fit(pd.DataFrame(PCA(n_components=2).fit_transform\
                                   (StandardScaler().fit_transform(train.drop("Class", axis = 1)))))\
        .decision_function(pd.DataFrame(PCA(n_components=2).fit_transform\
                                   (StandardScaler().fit_transform(test.drop("Class", axis = 1)))))
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s finished
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s finished
In [82]:
dec = [dec1, dec2]
In [83]:
%matplotlib inline

names1 = ['SVC', 'LOF']

names2 = ['ISOLATION FOREST']

models1 = [
    
    OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='scale',
      kernel='linear', max_iter=-1, nu=0.05, random_state=69,
      shrinking=True, tol=0.001, verbose=True),
    
     LocalOutlierFactor(n_neighbors=26, contamination = contam, n_jobs = -1, novelty=True)
    
     #DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
      #metric_params=None, min_samples=10, n_jobs=None, p=None),
    
    
]

models2 = [IsolationForest(behaviour='new', bootstrap=True,
        contamination=0.04689287075867327, max_features=0.8,
        max_samples=0.7, n_estimators=500, random_state=69,
        verbose=1)]
           
datasets = [train, test]

def plot_results(names, models, datasets, var_name, h = 0.02, mode="unsupervised", dec = None):
        
    i = 1
    
    figure = plt.figure(figsize=(27, 9), dpi = 50)
    
    for count, data in enumerate(datasets):
        
        pca = PCA(n_components=2)
        
        X, y = StandardScaler().fit_transform(data.drop(var_name, axis = 1)), data[var_name]

        X = pca.fit_transform(X)

        #X_train = pd.DataFrame(pca.fit_transform(X_train))

        #X_test = pd.DataFrame(pca.fit_transform(X_test))

        x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5

        y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                             np.arange(y_min, y_max, h))

        cm = plt.cm.RdYlGn

        cm_bright = ListedColormap(['#0cff00','#FF0000'])

        ax = plt.subplot(len(datasets), len(models) + 1, i)

        if count == 0:
            ax.set_title("Input data")
        # Plot the training points
        ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright,
                   edgecolors='k')
        
        # Plot the testing points
        ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright, alpha=0.6,
                   edgecolors='k')
        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        i += 1 
    
        for name, clf in zip(names, models):
            
            ax = plt.subplot(len(datasets), len(models) + 1, i)
            
            if count == 0:
                
                if hasattr(clf, "fit"):
                    
                    X_, y_ = pd.DataFrame(pca.fit_transform(
                                          StandardScaler()\
                                           .fit_transform(datasets[count].drop(var_name, axis = 1)\
                                                         ))), datasets[count][var_name]
                    clf.fit(X_, y_)
                    
                else:
                    
                    clf.fit_predict(X, y)
                
            else:
                
                try:
                    X_, y_ = pd.DataFrame(pca.fit_transform(
                                          StandardScaler()\
                                           .fit_transform(datasets[count-1].drop(var_name, axis = 1)\
                                                         ))), datasets[count-1][var_name]
                    
                    clf.fit(X_, y_)
                    
                except Exception as e:
                    
                    print(e)
                    
                    X_, y_ = pd.DataFrame(pca.fit_transform\
                                          (StandardScaler()\
                                           .fit_transform(datasets[count-1].drop(var_name, axis = 1)\
                                                         )))
                    clf.fit_predict(X_, y_)
            
            if mode == "unsupervised":
                
                '''
                try:

                    preds = [1 if pred == -1 else 0 for pred in clf.predict(X)]

                except:

                    try:
                        
                        preds = [1 if dec == -1 else 0 for dec in clf.decision_function(X)]
                        
                    except:
                        
                        try:
                            
                            preds = [1 if pred == -1 else 0 for pred in clf.fit_predic(X)]
                            
                '''
                try:
                    score = roc_auc_score(y, -clf.decision_function(X))
                except:
                    score = roc_auc_score(y, [1 if pred == -1 else 0 for pred in clf.predict(X)])

            else:
                try:
                    score = roc_auc_score(y, clf.decision_function(X))
                except:
                    score = roc_auc_score(y, clf.predict_proba(X)[:, 1])
            
            # Plot the decision boundary. For that, we will assign a color to each
            # point in the mesh [x_min, x_max]x[y_min, y_max].
            if hasattr(clf, "decision_function"):
                #Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
                
                try:
                    
                    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
                except:
                    Z1 = clf.decision_function(np.c_[xx.ravel()[:int(xx.ravel().shape[0]/2)], \
                                                   yy.ravel()[:int(yy.ravel().shape[0]/2)]])
                    
                    Z2 = clf.decision_function(np.c_[xx.ravel()[int(xx.ravel().shape[0]/2):], \
                                                   yy.ravel()[int(yy.ravel().shape[0]/2):]])
                    
                    Z = np.concatenate(Z1, Z2)
            else:
                    
                Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

            # Put the result into a color plot
            Z = Z.reshape(xx.shape)
            ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

            # Plot the training points
            ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright,
                       edgecolors='k')
            # Plot the testing points
            #ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
             #          edgecolors='k', alpha=0.6)

            ax.set_xlim(xx.min(), xx.max())
            ax.set_ylim(yy.min(), yy.max())
            ax.set_xticks(())
            ax.set_yticks(())
            if count == 0:
                ax.set_title(name)
            ax.text(xx.max() - .5, yy.min() + .8, ('%.2f' % score).lstrip('0'),
                    size=20, horizontalalignment='right')
            i += 1

    plt.tight_layout()
    plt.show()
    
    

Como estamos recibiendo un error constante de memoria, y aunque quede algo peor, vamos a dibujar por un lado los dos primeros modelos y por otro lado el Isolation Forest, a ver si de esta forma no colapsa.

In [84]:
plot_results(names = names1, models = models1, datasets=datasets, var_name = "Class")
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
C:\Anaconda3\lib\site-packages\sklearn\svm\classes.py:1175: DeprecationWarning: The random_state parameter is deprecated and will be removed in version 0.22.
  " be removed in version 0.22.", DeprecationWarning)
[LibSVM]
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
C:\Anaconda3\lib\site-packages\sklearn\svm\classes.py:1175: DeprecationWarning: The random_state parameter is deprecated and will be removed in version 0.22.
  " be removed in version 0.22.", DeprecationWarning)
[LibSVM]
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
In [85]:
plot_results(names = names2, models = models2, datasets=datasets, var_name = "Class")
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.6s finished
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-83-5ffb9fbda9ad> in plot_results(names, models, datasets, var_name, h, mode, dec)
    146 
--> 147                     Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    148                 except:

C:\Anaconda3\lib\site-packages\sklearn\ensemble\iforest.py in decision_function(self, X)
    343 
--> 344         return self.score_samples(X) - self.offset_
    345 

C:\Anaconda3\lib\site-packages\sklearn\ensemble\iforest.py in score_samples(self, X)
    401 
--> 402         depths += _average_path_length(n_samples_leaf)
    403 

C:\Anaconda3\lib\site-packages\sklearn\ensemble\iforest.py in _average_path_length(n_samples_leaf)
    445         n_samples_leaf_shape = n_samples_leaf.shape
--> 446         n_samples_leaf = n_samples_leaf.reshape((1, -1))
    447         average_path_length = np.zeros(n_samples_leaf.shape)

MemoryError: 

During handling of the above exception, another exception occurred:

MemoryError                               Traceback (most recent call last)
<ipython-input-85-8504bf72f5f0> in <module>()
----> 1 plot_results(names = names2, models = models2, datasets=datasets, var_name = "Class")

<ipython-input-83-5ffb9fbda9ad> in plot_results(names, models, datasets, var_name, h, mode, dec)
    147                     Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    148                 except:
--> 149                     Z1 = clf.decision_function(np.c_[xx.ravel()[:int(xx.ravel().shape[0]/2)],                                                    yy.ravel()[:int(yy.ravel().shape[0]/2)]])
    150 
    151                     Z2 = clf.decision_function(np.c_[xx.ravel()[int(xx.ravel().shape[0]/2):],                                                    yy.ravel()[int(yy.ravel().shape[0]/2):]])

C:\Anaconda3\lib\site-packages\sklearn\ensemble\iforest.py in decision_function(self, X)
    342         # an outlier:
    343 
--> 344         return self.score_samples(X) - self.offset_
    345 
    346     def score_samples(self, X):

C:\Anaconda3\lib\site-packages\sklearn\ensemble\iforest.py in score_samples(self, X)
    380         n_samples = X.shape[0]
    381 
--> 382         n_samples_leaf = np.zeros((n_samples, self.n_estimators), order="f")
    383         depths = np.zeros((n_samples, self.n_estimators), order="f")
    384 

MemoryError: 

Pese a los trucos que intentamos hacer nos sigue saliendo el MemoryError. Por lo tanto, los resultados del Isolation Forest los presentaremos de otra forma.

In [87]:
def plot_classification(results, test, metric = "Recall"):
    
    from matplotlib.colors import ListedColormap

    h=0.02
    
    X, y = test.drop('Class', axis = 1), test["Class"]
    
    comps = PCA(n_components=2).fit_transform(StandardScaler().fit_transform(X))
    
    x_min, x_max = comps[:, 0].min() - .5, comps[:, 0].max() + .5

    y_min, y_max = comps[:, 1].min() - .5, comps[:, 1].max() + .5

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                             np.arange(y_min, y_max, h))

    
    i = 1
    
    figure = plt.figure(figsize=(27, 9), dpi = 50)
    
    cm = plt.cm.RdYlGn

    cm_bright = ListedColormap(['#0cff00','#FF0000'])

    ax = plt.subplot(1, len(results.keys())+1, i)

    if i == 1:
        ax.set_title("Input data")
        # Plot the training points
    ax.scatter(comps[:, 0], comps[:, 1], c=y, cmap=cm_bright,
                   edgecolors='k')
        
        # Plot the testing points
    ax.scatter(comps[:, 0], comps[:, 1], c=y, cmap=cm_bright, alpha=0.8,
                   edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
        
    for model in results:
                     
        i += 1
        
        ypred = results[model]['preds']
        
        ax = plt.subplot(1, len(results.keys())+1, i)
        
        if metric == "Recall":
            
        
            score = recall_score(y, ypred)
            
        elif metric == "AUC":
            try:
                
                score = roc_auc_score(y, results[model]['dec_func'])
            except:
                score = roc_auc_score(y, results[model]['dec_func'][:, 1])
        
        ax.scatter(comps[:, 0], comps[:, 1], c = ypred, cmap=cm_bright,
                  alpha = 0.8, edgecolors='k')
        
        ax.set_xlim(comps[:, 0].min(), comps[:, 0].max())
        ax.set_ylim(comps[:, 1].min(), comps[:, 1].max())
        ax.set_xticks(())
        ax.set_yticks(())
        
        
        ax.set_title(model)
        ax.text(comps[:, 0].max() - .5, comps[:, 1].min() + .8, ('%.2f' % score).lstrip('0'),
                size=50, horizontalalignment='right')   
        
    plt.tight_layout()
    plt.show()    
        
        
In [88]:
plot_classification(res, test)

En cada uno de los dos gráficos que se pueden ver arriba, el primer gráfico de los mismos en ambos casos representa los datos tal cual. En el primero de los casos se presentan el train (arriba) y el test (abajo), representados sobre sus componentes principales. En rojo podemos ver los casos en los que no ha habido fraude y en azul vemos los casos en los que sí ha habido fraude. El mejor Isolation Forest en términos de AUC tiene un Recall Score de 0.67 (es el score más importante a la hora de prevenir el fraude, pues nos interesa sobre todo captar todos los verdaderos positivos, aunque esto sea a expensas de clasificar falsos positivos). En los casos de SVC y LOF son de 0.14 y de 0.93 respectivamente. Aunque pueda parecer que el LOF es el mejor (y en términos únicamente de Recall Score lo es), realmente clasifica prácticamente todo como fraude, por lo tanto no es tan sensible como el Isolation Forest, que como vemos hace una separación algo más fina, pues sus decision boundaries pueden tomar más formas, distintas a las que observamos del SVC (basado en separaciones lineales) o en el LOF (basado en distribución Gaussiana - de ahí la forma de elipse de su función de decisión sobre los puntos). Es importante mencionar, de todas formas, que los gráficos de arriba están hechos con los modelos entrenados sobre los dos primeros componentes principales (era la forma de pintar bonitas las regiones de decisión para los distintos treshold y demás), mientras que el Isolation Forest está entrenado sobre todos los datos; esto le da una ventaja a la hora de identificar los patrones de operaciones fraudulanetas.

In [89]:
####### INSERT YOUR CODE HERE

Supervised model

Let's check now whether we can improve the results using a supervised model, that is, a model that exploits the Class information available in the training data. Build an ensemble-based classification model that performs the best as possible, using only the data in the training set.

Vamos a hacer hyperparameter tuning con 3 modelos de Boosting a la vez: Random Forest, Adaboost, Gradient Boosting y Extreme Gradient Boosting (XGB). Utilizaremos inicialmente el AUC como métrica para medir nuestro error (o mejor dicho, nuestro grado de acierto, pues cuando mayor es el AUC, mejor clasificamos).

In [90]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.utils.multiclass import unique_labels
In [91]:
rf = RandomForestClassifier(warm_start=True)

rf_grid = {'n_estimators': [300, 500, 800, 1000],
          'max_depth':[3, 4, 6],
          'max_features': ['auto', 'sqrt'],
          'criterion': ['gini'],
          'bootstrap': [True],
          'n_jobs': [-1],
          'verbose': [1],
          'class_weight': [{0:1, 1:2}, None],
          'random_state': [69]}

ada = AdaBoostClassifier()

ada_grid = {'n_estimators': [100, 300, 500, 800, 1000],
           'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],
           'random_state': [69]}

gb = GradientBoostingClassifier(warm_start=True)

gb_grid = {'n_estimators': [300, 500, 800, 1000],
          'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],
          'subsample': [0.8, 1.0], 
          'max_depth':[3, 5],
          'random_state': [69],
          'validation_fraction': [0.1, 0.2]}

xgb = XGBClassifier()

xgb_grid = {'n_estimators': [300, 500, 800, 1000],
          'max_depth':[3, 5],
           'learning_rate': [0.01, 0.02, 0.05, 0.1],
           'n_jobs': [-1],
           'silent': [False],
           'colsample_bytree': [0.8, 1],
            'reg_lambda': [0.5, 1],
            'reg_alpha': [0, 0.5],
           'random_state': [69]}
In [92]:
totaldict = {'Random Forest': {'model': rf,
                              'params': rf_grid},
            'AdaBoost': {'model': ada,
                        'params': ada_grid},
            'GradientBoosting': {'model': gb,
                                'params': gb_grid},
            'XGB': {'model': xgb,
                   'params': xgb_grid}}
In [100]:
train = pd.read_csv('./data/fraud_train.csv', encoding = 'utf-8')
test = pd.read_csv('./data/fraud_test.csv', encoding = 'utf-8')
In [101]:
def get_supervised_results(models, train, test, var_name):
    
    Xtr, ytr = pd.DataFrame(StandardScaler().fit_transform(train.drop(var_name, axis = 1))), train[var_name]
    
    Xte, yte = pd.DataFrame(StandardScaler().fit_transform(test.drop(var_name, axis = 1))), test[var_name]
    
    resu = {}
    
    for model in models:
        
        resu[model] = {}
        
        gridse = GridSearchCV(estimator = models[model]["model"],
                             param_grid = models[model]['params'],
                             scoring = make_scorer(roc_auc_score, greater_is_better=True),
                             n_jobs = -1, verbose = 1, cv = 3)
        
        gridse.fit(Xtr, ytr)
        
        preds = gridse.predict(Xte)
        
        resu[model]['preds'] = preds
        
        resu[model]['model'] = gridse.best_estimator_
        
        if hasattr(gridse, "decision_function"):
            
            resu[model]['dec_func'] = gridse.decision_function(Xte)
            
            resu[model]['auc'] = roc_auc_score(ytest, gridse.decision_function(Xte))
        
        else:
            
            resu[model]['dec_func'] = gridse.predict_proba(Xte)
            
            try:
                resu[model]['auc'] = roc_auc_score(ytest, gridse.predict_proba(Xte))
                
            except:
                
                next
    
    return resu
                
            
            
    
    
In [102]:
results = get_supervised_results(models=totaldict, train=train, test=test, var_name = "Class")
Fitting 3 folds for each of 48 candidates, totalling 144 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:  4.3min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    1.2s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 300 out of 300 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 300 out of 300 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 300 out of 300 | elapsed:    0.0s finished
Fitting 3 folds for each of 25 candidates, totalling 75 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:  3.4min finished
Fitting 3 folds for each of 160 candidates, totalling 480 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed: 11.9min finished
Fitting 3 folds for each of 256 candidates, totalling 768 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done 768 out of 768 | elapsed: 19.7min finished
In [103]:
results['Random Forest']['auc'] = roc_auc_score(y_true = test['Class'], y_score=results['Random Forest']['dec_func'][:, 1])
In [104]:
results['XGB']['auc'] = roc_auc_score(y_true = test['Class'], y_score=results['XGB']['dec_func'][:, 1])  
In [105]:
results
Out[105]:
{'Random Forest': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': RandomForestClassifier(bootstrap=True, class_weight={0: 1, 1: 2},
              criterion='gini', max_depth=6, max_features='auto',
              max_leaf_nodes=None, min_impurity_decrease=0.0,
              min_impurity_split=None, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=300, n_jobs=-1, oob_score=False, random_state=69,
              verbose=1, warm_start=True),
  'dec_func': array([[3.35167791e-04, 9.99664832e-01],
         [1.33482476e-04, 9.99866518e-01],
         [3.46681581e-03, 9.96533184e-01],
         ...,
         [9.81350858e-01, 1.86491425e-02],
         [9.85644098e-01, 1.43559024e-02],
         [9.85855899e-01, 1.41441007e-02]]),
  'auc': 0.9817609756097561},
 'AdaBoost': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
            learning_rate=0.05, n_estimators=1000, random_state=69),
  'dec_func': array([ 0.1068558 ,  0.1238743 ,  0.12367596, ..., -0.08274831,
         -0.3083044 , -0.33095651]),
  'auc': 0.9731658536585366},
 'GradientBoosting': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': GradientBoostingClassifier(criterion='friedman_mse', init=None,
                learning_rate=0.05, loss='deviance', max_depth=3,
                max_features=None, max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=800,
                n_iter_no_change=None, presort='auto', random_state=69,
                subsample=1.0, tol=0.0001, validation_fraction=0.1,
                verbose=0, warm_start=True),
  'dec_func': array([  7.80864305,   9.21687067,   9.4032311 , ...,  -6.62160548,
         -10.43955885, -11.29690709]),
  'auc': 0.9750219512195122},
 'XGB': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
         colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_delta_step=0,
         max_depth=3, min_child_weight=1, missing=None, n_estimators=300,
         n_jobs=-1, nthread=None, objective='binary:logistic',
         random_state=69, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
         seed=None, silent=False, subsample=1),
  'dec_func': array([[1.4101028e-02, 9.8589897e-01],
         [7.6654553e-03, 9.9233454e-01],
         [8.0295801e-03, 9.9197042e-01],
         ...,
         [9.8453373e-01, 1.5466283e-02],
         [9.9848819e-01, 1.5118006e-03],
         [9.9933219e-01, 6.6779897e-04]], dtype=float32),
  'auc': 0.9778991869918698}}
In [106]:
y_pred = results['Random Forest']['preds']

np.set_printoptions(precision=3)

class_names = [0, 1]

# Plot non-normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, classes=class_names,
                      title='Confusion matrix for Random Forest, without normalization')

# Plot normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, classes=class_names, normalize=True,
                      title='Normalized confusion matrix for Random Forest')

plt.show()
Confusion matrix, without normalization
[[5000    0]
 [  41  205]]
Normalized confusion matrix
[[1.    0.   ]
 [0.167 0.833]]
In [107]:
RocPlot(y_test, model=results['Random Forest']['dec_func'][:, 1])

Como vemos el Random Forest tiene un AUC muy alto, bastante superior a los modelos que habíamos visto hasta ahora; cuand miramos la matriz de confusión también descubrimos con agradable sorpresa que el modelo capta suficientemente bien el fraude sin incurrir den casos de predecir fraude sin que lo sea. Se le escapan sólo el 17% de los casos de fraude en el test al Random Forest

In [108]:
plot_confusion_matrix(y_true=y_test, y_pred= results['AdaBoost']['preds'], classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for AdaBoostClassifier',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[1.    0.   ]
 [0.187 0.813]]
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f2089a9780>
In [109]:
RocPlot(y_test, model=results['AdaBoost']['dec_func'])

Como vemos AdaBoost también funciona mejor que los métodos de aprendizaje no supervisado en general, captando la mayoría de los casos de fraude; se le escapan un 19% en el test; algo peor que Random Forest.

In [110]:
plot_confusion_matrix(y_true=y_test, y_pred= results['GradientBoosting']['preds'], classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for Gradient Boosting',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[0.997 0.003]
 [0.146 0.854]]
Out[110]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f20d923908>
In [111]:
RocPlot(y_test, model=results['GradientBoosting']['dec_func'])

Como vemos en la matriz de confusión, Gradient Boosting predice aún más casos de fraude que los dos modelos de ensemble vistos anteriormente, dejándose sólo el 15% de los casos sin predecir. Sin embargo, tiene un AUC algo menor que Random Forest, mostrando que su función de decisión no es tan robusta quizás como la de los dos modelos anteriores. Vamos a ver qué tal resulta XGBoost.

In [112]:
plot_confusion_matrix(y_true=y_test, y_pred= results['XGB']['preds'], classes=[0,1],
                          normalize=True,
                          title='Normalized Confusion Matrix for XGB',
                          cmap=plt.cm.Greens)
Normalized confusion matrix
[[9.996e-01 4.000e-04]
 [1.667e-01 8.333e-01]]
Out[112]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f20d9e6240>
In [113]:
RocPlot(y_test, model=results['XGB']['dec_func'][:, 1])

Vemos que XGB tiene un AUC algo inferior a Random Forest, pero por encima de Adaboost o Gradient Boosting. Se deja sin predecir el mismo porcentaje de operaciones fraudulentas que Random Forest.

In [114]:
recalls = [recall_score(y_test, results[m]['preds']) for m in results]
In [115]:
models = [m for m in results]
models
Out[115]:
['Random Forest', 'AdaBoost', 'GradientBoosting', 'XGB']
In [116]:
rec_df = pd.DataFrame({'model': models,
                     'recall':recalls})
In [117]:
rec_df = rec_df.sort_values(by=['recall'], axis=0, ascending=False)
In [118]:
rec_df
Out[118]:
model recall
2 GradientBoosting 0.853659
0 Random Forest 0.833333
3 XGB 0.833333
1 AdaBoost 0.813008
Now create a visualization showing the performance of this supervised model on the test set, together with the unsupervised model. Has the performance improved after making use of the Class data?

Visualization of Supervised Models Results

In [119]:
plt.figure(figsize=(15,5))
sns.barplot(y="model", x="recall", data=rec_df)
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f20dbf4550>
In [120]:
plot_classification(results, test)
In [121]:
plot_classification(results, test, metric = "AUC")

En los gráficos de arriba podemos ver que el modelo que mejor Recall Score tiene es el Gradient Boosting, seguido de Random Forest y XGB, y por último Adaboost. En el segundo de estos gráficos tenemos los input data, así como la clasificación que ha hecho cada uno de los modelos. Los puntos verdes en el Input Data son los que realmente no habían sido fraude, mientras que los rojos son los que sí han sido fraude. En los gráficos que se ven al lado del Input data podemos ver la clasificación que ha hecho cada uno de los modelos. Primero presentamos el Recall Score, y vemos que el que más tiene es como dijimos el Gradient Boosting con 0.85, XGB y RF tienen 0.83 y Adaboost tiene 0.81. Justo después de este gráfico tenemos uno similar en el que los comparamos en términos de AUC. Vemos que el AUC de todos es muy parecido, siendo Adaboost el que menos tiene.

In [122]:
auc_df = pd.DataFrame({'model': [m for m in results],
                      'auc': [results[m]['auc'] for m in results]})
In [123]:
plt.figure(figsize=(15,5))
sns.barplot(y="model", x="auc", data=auc_df)
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f20e38bd30>

Para ver cómo son las regiones de la función de decisión de estos modelos, vamos a hacer lo mismo que hicimos en la parte de no supervisado, es decir vamos a dibujar las regiones (para los diferentes treshold) de la función de decisión que tendrían estos modelos usando sólo los dos primeros componentes principales.

Vemos que en términos de AUC los modelos son prácticamente iguales, y cuesta mucho distinguirles en función de esta métrica visualmente.

In [124]:
names = ['Random Forest','AdaBoost', 'Gradient Boosting', 'XGB']

models = [results[m]['model'] for m in results]
In [125]:
#models
In [ ]:
####### INSERT YOUR CODE HERE

Advanced Search and Stacking

In [126]:
from skopt import BayesSearchCV
from sklearn.model_selection import train_test_split
In [127]:
def get_supervised_results_bayes(models, train, test, var_name):
    
    X_train, y_train = pd.DataFrame(StandardScaler().fit_transform(train.drop(var_name, axis = 1))), train[var_name]
    
    X_test, y_test = pd.DataFrame(StandardScaler().fit_transform(test.drop(var_name, axis = 1))), test[var_name]
    
    resu = {}
    
    for model in models:
        
        resu[model] = {}
        
        gridse = BayesSearchCV(estimator = models[model]["model"],
                             search_spaces = models[model]['params'],
                             scoring = make_scorer(roc_auc_score, greater_is_better=True),
                             n_jobs = -1, verbose = 1, cv = 3, n_iter = 50, random_state = 69)
        
        gridse.fit(X_train, y_train)
        
        preds = gridse.predict(X_test)
        
        resu[model]['preds'] = preds
        
        resu[model]['model'] = gridse.best_estimator_
        
        if hasattr(gridse, "decision_function"):
            
            resu[model]['dec_func'] = gridse.decision_function(X_test)
            
            resu[model]['auc'] = roc_auc_score(y_test, gridse.decision_function(X_test))
        
        else:
            
            resu[model]['dec_func'] = gridse.predict_proba(X_test)
            
            try:
                resu[model]['auc'] = roc_auc_score(y_test, gridse.predict_proba(X_test))
                
            except:
                
                next
    
    return resu
                
            
            
In [128]:
rf = RandomForestClassifier(warm_start=True)

rf_grid = {'n_estimators': [300, 500, 800, 1000],
          'max_depth':[3, 4, 6],
          'max_features': ['auto', 'sqrt'],
          'criterion': ['gini'],
          'bootstrap': [True],
          'n_jobs': [-1],
          'verbose': [1],
           'random_state': [69]}

ada = AdaBoostClassifier()

ada_grid = {'n_estimators': [100, 300, 500, 800, 1000],
           'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],
           'random_state': [69]}

gb = GradientBoostingClassifier(warm_start=True)

gb_grid = {'n_estimators': [300, 500, 800, 1000],
          'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],
          'subsample': [0.8, 1.0], 
          'max_depth':[3, 5],
          'random_state': [69],
          'validation_fraction': [0.1, 0.2]}

xgb = XGBClassifier()

xgb_grid = {'n_estimators': [300, 500, 800, 1000],
          'max_depth':[3, 5],
           'learning_rate': [0.01, 0.02, 0.05, 0.1],
           'n_jobs': [-1],
           'silent': [False],
           'colsample_bytree': [0.8, 1],
            'reg_lambda': [0.5, 1],
            'reg_alpha': [0, 0.5],
           'random_state': [69]}


totaldict = {'Random Forest': {'model': rf,
                              'params': rf_grid},
            'AdaBoost': {'model': ada,
                        'params': ada_grid},
            'GradientBoosting': {'model': gb,
                                'params': gb_grid},
            'XGB': {'model': xgb,
                   'params': xgb_grid}}


resultsbayes = get_supervised_results_bayes(models = totaldict, train = train, test = test, var_name = "Class")
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    4.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    9.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    8.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    1.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    1.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    4.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    5.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    5.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    1.8s finished
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-128-a4d842a0b28d> in <module>()
     48 
     49 
---> 50 resultsbayes = get_supervised_results_bayes(models = totaldict, train = train, test = test, var_name = "Class")

<ipython-input-127-ca645a3811af> in get_supervised_results_bayes(models, train, test, var_name)
     16                              n_jobs = -1, verbose = 1, cv = 3, n_iter = 50, random_state = 69)
     17 
---> 18         gridse.fit(X_train, y_train)
     19 
     20         preds = gridse.predict(X_test)

C:\Anaconda3\lib\site-packages\skopt\searchcv.py in fit(self, X, y, groups, callback)
    652                 optim_result = self._step(
    653                     X, y, search_space, optimizer,
--> 654                     groups=groups, n_points=n_points_adjusted
    655                 )
    656                 n_iter -= n_points

C:\Anaconda3\lib\site-packages\skopt\searchcv.py in _step(self, X, y, search_space, optimizer, groups, n_points)
    539 
    540         # get parameter values to evaluate
--> 541         params = optimizer.ask(n_points=n_points)
    542         params_dict = [point_asdict(search_space, p) for p in params]
    543 

C:\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py in ask(self, n_points, strategy)
    370                 opt._tell(x, (y_lie, t_lie))
    371             else:
--> 372                 opt._tell(x, y_lie)
    373 
    374         self.cache_ = {(n_points, strategy): X}  # cache_ the result

C:\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py in _tell(self, x, y, fit)
    484             with warnings.catch_warnings():
    485                 warnings.simplefilter("ignore")
--> 486                 est.fit(self.space.transform(self.Xi), self.yi)
    487 
    488             if hasattr(self, "next_xs_") and self.acq_func == "gp_hedge":

C:\Anaconda3\lib\site-packages\skopt\learning\gaussian_process\gpr.py in fit(self, X, y)
    193                 noise_level=self.noise, noise_level_bounds="fixed"
    194             )
--> 195         super(GaussianProcessRegressor, self).fit(X, y)
    196 
    197         self.noise_ = None

C:\Anaconda3\lib\site-packages\sklearn\gaussian_process\gpr.py in fit(self, X, y)
    230             optima = [(self._constrained_optimization(obj_func,
    231                                                       self.kernel_.theta,
--> 232                                                       self.kernel_.bounds))]
    233 
    234             # Additional runs are performed from log-uniform chosen initial

C:\Anaconda3\lib\site-packages\sklearn\gaussian_process\gpr.py in _constrained_optimization(self, obj_func, initial_theta, bounds)
    474         if self.optimizer == "fmin_l_bfgs_b":
    475             theta_opt, func_min, convergence_dict = \
--> 476                 fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
    477             if convergence_dict["warnflag"] != 0:
    478                 warnings.warn("fmin_l_bfgs_b terminated abnormally with the "

C:\Anaconda3\lib\site-packages\scipy\optimize\lbfgsb.py in fmin_l_bfgs_b(func, x0, fprime, args, approx_grad, bounds, m, factr, pgtol, epsilon, iprint, maxfun, maxiter, disp, callback, maxls)
    197 
    198     res = _minimize_lbfgsb(fun, x0, args=args, jac=jac, bounds=bounds,
--> 199                            **opts)
    200     d = {'grad': res['jac'],
    201          'task': res['message'],

C:\Anaconda3\lib\site-packages\scipy\optimize\lbfgsb.py in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
    333             # until the completion of the current minimization iteration.
    334             # Overwrite f and g:
--> 335             f, g = func_and_grad(x)
    336         elif task_str.startswith(b'NEW_X'):
    337             # new iteration

C:\Anaconda3\lib\site-packages\scipy\optimize\lbfgsb.py in func_and_grad(x)
    283     else:
    284         def func_and_grad(x):
--> 285             f = fun(x, *args)
    286             g = jac(x, *args)
    287             return f, g

C:\Anaconda3\lib\site-packages\scipy\optimize\optimize.py in function_wrapper(*wrapper_args)
    298     def function_wrapper(*wrapper_args):
    299         ncalls[0] += 1
--> 300         return function(*(wrapper_args + args))
    301 
    302     return ncalls, function_wrapper

C:\Anaconda3\lib\site-packages\scipy\optimize\optimize.py in __call__(self, x, *args)
     61     def __call__(self, x, *args):
     62         self.x = numpy.asarray(x).copy()
---> 63         fg = self.fun(x, *args)
     64         self.jac = fg[1]
     65         return fg[0]

C:\Anaconda3\lib\site-packages\sklearn\gaussian_process\gpr.py in obj_func(theta, eval_gradient)
    222                 if eval_gradient:
    223                     lml, grad = self.log_marginal_likelihood(
--> 224                         theta, eval_gradient=True)
    225                     return -lml, -grad
    226                 else:

C:\Anaconda3\lib\site-packages\sklearn\gaussian_process\gpr.py in log_marginal_likelihood(self, theta, eval_gradient)
    432 
    433         if eval_gradient:
--> 434             K, K_gradient = kernel(self.X_train_, eval_gradient=True)
    435         else:
    436             K = kernel(self.X_train_)

C:\Anaconda3\lib\site-packages\sklearn\gaussian_process\kernels.py in __call__(self, X, Y, eval_gradient)
    688         """
    689         if eval_gradient:
--> 690             K1, K1_gradient = self.k1(X, Y, eval_gradient=True)
    691             K2, K2_gradient = self.k2(X, Y, eval_gradient=True)
    692             return K1 + K2, np.dstack((K1_gradient, K2_gradient))

C:\Anaconda3\lib\site-packages\sklearn\gaussian_process\kernels.py in __call__(self, X, Y, eval_gradient)
    763         if eval_gradient:
    764             K1, K1_gradient = self.k1(X, Y, eval_gradient=True)
--> 765             K2, K2_gradient = self.k2(X, Y, eval_gradient=True)
    766             return K1 * K2, np.dstack((K1_gradient * K2[:, :, np.newaxis],
    767                                        K2_gradient * K1[:, :, np.newaxis]))

C:\Anaconda3\lib\site-packages\skopt\learning\gaussian_process\kernels.py in __call__(self, X, Y, eval_gradient)
    389             raise ValueError(
    390                 "Expected X to have %d features, got %d" %
--> 391                 (X.shape, len(length_scale)))
    392 
    393         n_samples, n_dim = X.shape

TypeError: %d format: a number is required, not tuple

El error que nos da BayesSearchCV es un error interno que no podemos resolver nosotros, por lo tanto continuamos con la siguiente parte.

In [129]:
def stacking_model(models, train, test, var_name):
    
    X_train, X_val, y_train, y_val = train_test_split(StandardScaler().fit_transform(train.drop(var_name, axis = 1)),
                                                    train[var_name], test_size = 0.4, random_state = 69)
    
    Xtest, ytest = StandardScaler().fit_transform(test.drop(var_name, axis = 1)), test[var_name]
    
    val = pd.DataFrame(X_val).copy()
    
    te = pd.DataFrame(X_test).copy()
    
    
    for i, model in enumerate(models):
        
        model.fit(X_train, y_train)
        
        preds = model.predict(X_val)
        
        val['preds' + str(i)] = preds
        
        assert X_train.shape[1] == Xtest.shape[1]
        
        preds2 = model.predict(Xtest)
        
        te['preds'+str(i)] = preds2
        
    xgb = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_delta_step=0,
        max_depth=3, min_child_weight=1, missing=None, n_estimators=300,
        n_jobs=-1, nthread=None, objective='binary:logistic',
        random_state=69, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
        seed=None, silent=False, subsample=1)
    
    xgb.fit(val, y_val)
    
    preds = xgb.predict(te)
    
    auc = roc_auc_score(ytest, xgb.predict_proba(te)[:, 1])
    
    recall = recall_score(ytest, preds)
    
    stack = {'preds': preds,
            'auc': auc,
            'recall': recall, 
            'model': xgb,
            'dec_func': xgb.predict_proba(te)[: , 1]}
    
    return stack
In [130]:
stack_results = stacking_model(models = models, train = train, test = test, var_name ="Class")
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:308: UserWarning: Warm-start fitting without increasing n_estimators does not fit new trees.
  warn("Warm-start fitting without increasing n_estimators does not "
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 300 out of 300 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 300 out of 300 | elapsed:    0.0s finished
In [131]:
stack_results
Out[131]:
{'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
 'auc': 0.969430081300813,
 'recall': 0.8536585365853658,
 'model': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_delta_step=0,
        max_depth=3, min_child_weight=1, missing=None, n_estimators=300,
        n_jobs=-1, nthread=None, objective='binary:logistic',
        random_state=69, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
        seed=None, silent=False, subsample=1),
 'dec_func': array([9.971e-01, 9.971e-01, 9.977e-01, ..., 1.081e-03, 4.937e-04,
        2.873e-04], dtype=float32)}
In [132]:
results['stacking'] = stack_results
In [133]:
auc_df = pd.DataFrame({'model': [m for m in results],
                      'auc': [results[m]['auc'] for m in results]}).sort_values(by=['auc'], axis=0, ascending=False)
In [134]:
plt.figure(figsize=(15,5))
sns.barplot(y="model", x="auc", data=auc_df)
Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f20fc12940>

Vemos que en términos de AUC el Stacking Model sería el peor, aunque con una diferencia muy pequeña sobre el resto. Vamos a ver ahora qué tal quedaría en términos de Recall.

In [135]:
rec_df = pd.DataFrame({'model': [m for m in results],
                      'recall': [recall_score(y_test, results[m]['preds']) for m in results]}).sort_values(by=['recall'],
                                                                                                          axis = 0,
                                                                                                          ascending = False)
In [136]:
plt.figure(figsize=(15,5))
sns.barplot(y="model", x="recall", data=rec_df)
Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f20dc508d0>

Sin embargo en términos de Recall los resultados son bastante buenos, similares a los de Gradient Boosting. Podemos determinar, por lo tanto, que los mejores modelos para luchar contra el fraude serían o bien el Gradient Boosting o bien el stacking model, pues estos son los que más veces predicen si el individuo va o no va a hacer fraude. En general los modelos de aprendizaje supervisado, que explotan la información contenida en la variable "Class" funcionan mejor que los modelos de aprendizaje no supervisado, donde para encontrar las anomalías no se cuenta con la información contenida en esta variable y por lo tanto tienen algo menos de conocimiento respecto al problema que se intenta resolver.

In [137]:
plot_classification(results=results, test = test, metric = "Recall")
In [138]:
plot_classification(results=results, test=test, metric="AUC")

Por último vamos a probar un modelo que llamaremos stacking2 que será básicamente coger el voto mayoritario de todos los modelos de ensemble, utilizando todas las predicciones.

In [145]:
results
Out[145]:
{'Random Forest': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': RandomForestClassifier(bootstrap=True, class_weight={0: 1, 1: 2},
              criterion='gini', max_depth=6, max_features='auto',
              max_leaf_nodes=None, min_impurity_decrease=0.0,
              min_impurity_split=None, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=300, n_jobs=-1, oob_score=False, random_state=69,
              verbose=1, warm_start=True),
  'dec_func': array([[3.352e-04, 9.997e-01],
         [1.335e-04, 9.999e-01],
         [3.467e-03, 9.965e-01],
         ...,
         [9.814e-01, 1.865e-02],
         [9.856e-01, 1.436e-02],
         [9.859e-01, 1.414e-02]]),
  'auc': 0.9817609756097561},
 'AdaBoost': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
            learning_rate=0.05, n_estimators=1000, random_state=69),
  'dec_func': array([ 0.107,  0.124,  0.124, ..., -0.083, -0.308, -0.331]),
  'auc': 0.9731658536585366},
 'GradientBoosting': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': GradientBoostingClassifier(criterion='friedman_mse', init=None,
                learning_rate=0.05, loss='deviance', max_depth=3,
                max_features=None, max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=800,
                n_iter_no_change=None, presort='auto', random_state=69,
                subsample=1.0, tol=0.0001, validation_fraction=0.1,
                verbose=0, warm_start=True),
  'dec_func': array([  7.809,   9.217,   9.403, ...,  -6.622, -10.44 , -11.297]),
  'auc': 0.9750219512195122},
 'XGB': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'model': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
         colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_delta_step=0,
         max_depth=3, min_child_weight=1, missing=None, n_estimators=300,
         n_jobs=-1, nthread=None, objective='binary:logistic',
         random_state=69, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
         seed=None, silent=False, subsample=1),
  'dec_func': array([[1.410e-02, 9.859e-01],
         [7.665e-03, 9.923e-01],
         [8.030e-03, 9.920e-01],
         ...,
         [9.845e-01, 1.547e-02],
         [9.985e-01, 1.512e-03],
         [9.993e-01, 6.678e-04]], dtype=float32),
  'auc': 0.9778991869918698},
 'stacking': {'preds': array([1, 1, 1, ..., 0, 0, 0], dtype=int64),
  'auc': 0.969430081300813,
  'recall': 0.8536585365853658,
  'model': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
         colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_delta_step=0,
         max_depth=3, min_child_weight=1, missing=None, n_estimators=300,
         n_jobs=-1, nthread=None, objective='binary:logistic',
         random_state=69, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
         seed=None, silent=False, subsample=1),
  'dec_func': array([9.971e-01, 9.971e-01, 9.977e-01, ..., 1.081e-03, 4.937e-04,
         2.873e-04], dtype=float32)}}
In [150]:
def voto_mayoritario(models, ytest):
    
    preds_ = []
    
    for i in range(5246):
        
        pr = [models[m]['preds'][i] for m in models]
        
        if sum(pr) >= 3:
            
            preds_.append(1)
        else:
            preds_.append(0)
    
    recall = recall_score(ytest, preds_)
    
    re = {'recall': recall,
         'preds': preds_}
    
    return re 
    
In [151]:
stacking2 = voto_mayoritario(models=results, ytest= test['Class'])
In [154]:
results['stacking2'] = stacking2

Vemos que el recall con el voto mayoritario es algo inferior a los que habíamos obtenido antes. Podemos decir, finalmente, que nos quedamos por lo tanto preferiblemente con el Gradient Boosting, pues es el modelo que mejor Recall Score tenía siendo además alto su AUC (a diferencia del stacking anterior que presentamos). Como hemos visto, tanto visualmente como en números, es que en general los modelos de aprendizaje supervisado son mejores que los modelos de aprendizaje no supervisado a la hora de detectar el fraude. Podemos por tanto utilizar cualquiera de estos modelos (todos tienen una tasa de error muy baja) para luchar contra el fraude, siendo el Gradient Boosting el modelo que en teoría nos haría perder la menor cantidad de dinero; el stacking primero que presentamos parece que también pilla muy bien los casos de fraude, pero tiene un AUC más bajo, mostrando que lo hace a costa de predecir más veces fraude de las que toca. Estos errores, de todas formas, son más baratos para la compañía que el error del otro tipo, es decir predecir que alguien no va a hacer fraude y finalmente lo hace. Por este motivo nos hemos fijado mucho en el AUC y especialmente en el Recall Score a la hora de elegir los modelos, pues estamos especialmente interesados en captar los casos de fraude, aunque esto sea a costa de "acusar" incorrectamente a otra persona.

In [155]:
plot_classification(results = results, test = test, metric = "Recall")
In [156]:
rec_df = pd.DataFrame({'model': [m for m in results],
                      'recall': [recall_score(y_test, results[m]['preds']) for m in results]}).sort_values(by=['recall'],
                                                                                                          axis = 0,
                                                                                                          ascending = False)

plt.figure(figsize=(15,5))
sns.barplot(y="model", x="recall", data=rec_df)
Out[156]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f215d9f1d0>